cd "/disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/replication_scrap/"


* create zip code to latitude/longitude file

capture log close
set more off
log using reduce.log, replace 
clear

* process SAS data
use /disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/mlearn/intermediate/zip2latlon/src_sasdata/zipcode_13q1_unique.dta
keep zip y x
rename y lat
rename x lon

* 5 zips have 0 for lat and lon... weird. drop them
drop if lat==0 & lon==0
assert lat!= 0 & lon!=0

isid zip
sort zip
tempfile zip_sas
save `zip_sas'

* process GeoNames data
clear
insheet  using /disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/mlearn/intermediate/zip2latlon/src_geonames/US.txt
keep v2 v10 v11
rename v2 zip
rename v10 lat
rename v11 lon

* some zips are listed multiple times in this data (i guess when they span
* multiple cities???), which gives the zip multiple latitude/longitude coords 
* of the 43 duplicated zips, all but one was in the SAS data
* since we prefer sas coordinates, rather than go through the effort of
* computing some average latitude/longitude for the duplicates, i will
* just drop these
duplicates tag zip, gen(t)
drop if t>0
drop t

isid zip
sort zip
tempfile zip_geonames
save `zip_geonames'

* bring together two datasets
use `zip_sas'
* SAS data takes precedence
merge 1:1 zip using `zip_geonames'
gen byte src_sas = _merge==1|_merge==3
drop _merge

sort zip
capture noisily mkdir Intermediate_Output_Not_Exportable/
save Intermediate_Output_Not_Exportable/zip.dta

log close
